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°So far we’ve seen the magic of learning in 
¢ Neurons, organisms, algorithms, and materials 


¢But what is it that makes ML so special and powerful? 
¢eAnd have we seen examples of “mathematics that learns” before? 
*Concerns, Limitations and Open Problems in ML 


What Makes ML So Special? 


¢Although there is no one specific answer, and may 
vary from ML to ML, some broader aspects include... 


The Power of Universal Approximation... 
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The Power of Universal Approximation... 


¢ NN with 1 hidden layer can represent: 


— any bounded continuous function (to arbitrary ¢€) 
* Universal Approximation Theorem [Cybenko 1989] 
— any Boolean function (exactly) 
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The Power of Universal Approximation... 


¢ NN with 1 hidden layer can represent: 
— any bounded continuous function (to arbitrary €) 
¢ Universal Approximation Theorem [Cybenko 1989] 
— any Boolean function (exactly) 


f(x) = 


A small term 
Oax* 


The Power of Universal Approximation... 


Theorem (Cybenko) 


Let o be any continuous discriminatory function. 
Then finite sums of the form 


N 
G(x) = S> ajo(w;! x + bj), where wj ER”, aj, bh ER 
j=l 
C) G(X) are dense in C (i53: 


In other words, given any « > 0 and f € C(I,), there is a sum 
G(x) of the above form such that 


|G(x) — F(x)|<e, VxeEl, 
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But Why? 


ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ 
GIKI - FES 


But Why? 


We all know 
about piece- 
wise linear 
approximations. 
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ya 
f(x) = sin(2 x) — 10 +3 


We all know 
about piece- 
wise linear 
approximations. 


..we could split any continuous function into several regions 
and approximate its value in each region by height of a 


rectangle and still get arbitrarily close to the function values 
(as long as we have sufficient number of regions)! 


a 


Output from top hidden neuron 


Ho 1 \ a ——— —————— 


? -b/w = 0.50 


Source: neuralnetworksanddeeplearning.com (a MUST VISIT Interactive page by Michael Nielsen 
tr caa all thic in actinn) 
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Output from top hidden neuron 


b = -40 
= = 


Weighted output from hidden layer 


Source: neuralnetworksanddeeplearning.com (a MUST VISIT Interactive page by Michael Nielsen 


tn caa all thic In actinn) 
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Output from top hidden neuron 


- => 


Zz } \ > 0 : 
Sa x a ha =e 
P. 
| 0.4 | 
patel Reta, 5 Weighted output from hidden layer = Mi 4 
we TS oes Y -—aas 7% 
; i { 0.6 ) 
‘ 3) a 0.28 ™* \ ef 
a —, ~ ae 
\ \ \. oe 
| a y—™~ 
“hy WwW) = 0.4 \ { 0.6 } 
~ —— 1 \ } 
Se 1 Ne Y 
a, ae a i h = 0.7 
x eee se { 0.8 ) 
ge : ~*~ Zz 1 — 
ra 85 — 0.57 i ae 
| NAW, = 1.2 | ( 0.8 ) 
: | - S< h=15 
‘ Je Pee S| —- 
= ¢ / \ 
Pi 228 { 1.0 } 
Seat SS eee he: / 


Source: neuralnetworksanddeeplearning.com (a MUST VISIT Interactive page by Michael Nielsen 


tn caa all thic In actinn) 


Weighted output from hidden layer 


3 


Average deviation: 1.88 


| Reset — 
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¢But Universal Approximation is only part of the story... 
°Why? 
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¢But Universal Approximation is only part of the story... 
¢Why? Because we have so many other universal approximators... 
*E.d.; 


¢But Universal Approximation is only part of the story... 
¢ Why? Because we have so many other universal approximators... 
°E.g., Polynomials, Power Series, Fourier Series... 


...other “Universal” Approximators 


* : ? 
Ph)= GX +a, x +a x7 b+ FAK + a,X +, 


ES691 - Mathematics for Machine Learning / Dr. Naveed R. 
Butt @ GIKI - FES 


...other “Universal” Approximators 


* a- 2 
Ph)= GX +4, xX +a x7 bt AK + a,x +, | 
sin(x) 
f(x) 
Weierstrass Approximation Theorem 
¢ If f(x) is a continuous real-valued function = ‘ 
on [a, b] then for any ¢ > 0, then there \ L 
exists a polynomial P,, on [a, b] such that 
-21 -1l Tl 21 


If) — P(x) < € 
for all x € [a, b]. 


Taylor's theorem!“!°!I°] — Let k 2 1 be an integer and let the function f: R  R be k times 
differentiable at the point a € R. Then there exists a function h, : R — R such that 


f"(a) 


f\ (a) 


2 
(a —a)" +-+-+ iI 


f(x) = f(a) + f'(a)(w — a) + (a —a)* + hy(a)(a — a)", 


2! 
and 

lim hy (x) =, 

ra 
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...other “Universal” Approximators 


Fourier series, sine-cosine form 


8, (x) = Ao + s (A, cos (2722) + B, sin(2752) ) (Eq.2) 


n=1 
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...other “Universal” Approximators 


Fourier series, sine-cosine form 


8, (x) = Ap + 3 (A, cos (2722) + B, sin(272) ) (Eq.2) 


Inverse transform 


f(«) = / F() eae, VeER. (eq2) 


Eq.2 is a representation of f(a) as a weighted summation of complex exponential functions. 


Fourier transform 


qGe / f(x) e-** de. (Eq4) 


ES691 - Mathematics for Machine Learning / Dr. Naveed R. 
Butt @ GIKI - FES 


21 


In fact, approximation (ability to memorize and fit 
available data) alone Is not sufficient. 


¢Power of “generalization” (interpolation, extrapolation, 
fitting to new data, learning of general class) is critical, 
and modern ML seems to do exceptionally well in this. 


The 


Power of 


OVERPARAMETRIZATION... (and how it may help 


Occam’s Razor 


“Generalization” ) 
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The Power of 


OVERPARAMETRIZATION... (and how it may help 
“Generalization” ) 


Occam’s Razor 


Pluralitas non est 
ponenda sine 
necessitate 


¢ “Plurality should not be 
posited without 
necessity.” 


In Mathematical 
Modeling Language: 
“Model parameters 
Should not be 
Increased without 
necessity”. 
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The Power of 
OVERPARAMETRIZATION... 


Occam’s Razor 


Pluralitas non est 
ponenda sine 
necessitate 


¢ “Plurality should not be 
posited without 
necessity.” 


In Mathematical 
Modeling Language: 
“Model parameters 
Should not be 
Increased without 
necessity”. 


(and how it may help 
“Generalization”) 


But... 


- Overparameterization seems to help Neural 


Networks 
- Why? 


Open question, 
with several 


theories... 
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The Power of 


OVERPARAMETRIZATION... (and how it may help 
“Generalization” ) 


Hypothesis: Overparameterization Leads to 


Sparsit — . 
p Y. Overparameterization lets the learning algorithm choose the 


best options (parameters) among data-dependent couplings and 
essentially “discard” the rest (leading to sparsity). 
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The Power of 
OVERPARAMETRIZATION... (and how it may help 
“Generalization” ) 
Hypothesis: Overparameterization Leads to 
Sparsity 


Overparameterization lets the learning algorithm choose the 
best options (parameters) among data-dependent couplings and 
essentially “discard” the rest (leading to sparsity). 


® e The original “dense” network (left) 


and its “pruned” subnet (right) both 
° e e give very similar performance if 


subnet initialized with same weights 


yy ee @ that original network was initialized 
with when it successfully learned. 
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The Power of . 
OVERPARAMETRIZATION... (and how it may help 
“Generalization” ) 
Hypothesis: Overparameterization Leads to 
Sparsity 


Overparameterization lets the learning algorithm choose the 
best options (parameters) among data-dependent couplings and 
essentially “discard” the rest (leading to sparsity). 


Lottery Ticket Hypothesis 


Random initializations lead to some initializations that are 


fastest to train towards optima (recall Ant Colony, random 
“rections taken at first). 


® e The original “dense” network (left) 


and its “pruned” subnet (right) both 
e000 give very similar performance _ if 


subnet initialized with same weights 


Ce @ @ that original network was initialized 


with when it successfully learned. 
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The Power of 


OVERPARAMETRIZATION... (and how it may help 
“Generalization” ) 


Overparameterization and Random Initializations Help Weight-Update 


Algorithm 
go - Gradient Descent has been shown to provably optimize 


overparametrized NNs. 
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The Power of 


OVERPARAMETRIZATION... (and how it may help 
“Generalization” ) 


Overparameterization and Random Initializations Help Weight-Update 


Algorithm 
go - Gradient Descent has been shown to provably optimize 


overparametrized NNs. 


Algorithmic Stability Hypothesis 


- Algorithms that do not vary too much with slight changes in 
training datasets, generalize better (possibly indicating that 
they have learned underlying distributions rather than just the 
data). 
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The Power of 


OVERPARAMETRIZATION... (and how it may help 
“Generalization” ) 


Manifold Hypothesis 


- Many high-dimensional data sets that occur in the real world 
actually lie along low-dimensional latent manifolds inside that 
high-dimensional space. 

- AS a consequence, many data sets that appear to initially require 
many variables to describe, can actually be described by a 
comparatively small number of variables. 

- Machine learning models only have to fit relatively simple, low- 
dimensional, highly structured subspaces within their potential 
Input space (latent manifolds). 

- Within one of these manifolds, it is always possible to interpolate 
between two inputs, that is to say, morph one into another via a 
continuous path along which all points fall on the manifold. 
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The Power of 


OVERPARAMETRIZATION... (and how it may help 
“Generalization” ) 


Manifold Hypothesis 


R° : » 
® 


Fig. 2. Data is often embedded in (lies on) a lower- 

dimensional structure or manifold. It should be 

possible to characterize the data and the 

Fig. 1. Example of high-dimensional data lying in low-dimensional subspaces. It is relationship between individual points using fewer 

seen that rather than uniformly distributed in the 3-dimensional space, these data . p . . 

points lie on the union of two lines and one plane. dimensions, if we were able to measure distances 
on the manifold itself instead of in Euclidean space. 
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¢Of course, we could get ourselves in trouble (e.g., 
computationally) with heavy overparameterization!! 
¢ Next, let’s look at some other aspects that help in this regard. 


The Power of LOTS of (REPEATED) SIMPLE UNITS 


(with localized “gating” of 


with SIMPLE RULES information...) 
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The Power of LOTS of (REPEATED) SIMPLE UNITS 


with SIMPLE RULES (with localized “gating” of 


information.) 


Lab (change 


Learning without kenwes 


+) 
yy 
. INN WA 


SS N 
\ \ S 
SS 


AN 


1. weigh 2. sum up 3. activate 
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(with localized “gating” of 
information... 


oe tee) 
eed iy oa 
= Fae 


SN & 


ag hse 


indation 


sum _ bias } 


| __ . : 1.4 


OSS 


whe 
\\\ ty AANA 


Start 


1. weigh 2. sum up 3. activate 
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Advantage of Having More Parameters (e.g., 


Weights) to Play With... (without exploding 
computations...) 


ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 


37 


Advantage of 
Weights) to Pla 


Having More Parameters (e.g., . . 
With... (without exploding 
computations...) 
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Advantage of Having More Parameters (e.g., 


Weights) to Play With... (without exploding 
computations...) 


f(z) f(z) 
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The Power of Bases (e.g., Features) and Their 
Weighted Combinations... 
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The Power of Bases (e.g., Features) and Their 
Weighted Combinations... 


Sparse ‘Top-down 
Coding —— term 


Max-margin Model 


Objective 


Over- 
segmentation 


Bottom-up 
term 
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The Power of Bases (e.g., Features) and 
Weighted Combinations... 
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The Power of Basis (e.g., Features) and 
Weighted Combinations... 


OQ. Can we write 
functions as 
weighted sums of 
SINUSOICS 
(features)? 
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The Power of Basis (e.g., Features) and Their 
Weighted Combinations... 


~ 
f, 1 
f, 0.5 
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Ingredient Amount 


(sinusoid (scaling) 
frequency) 


Add all 
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The Power of Basis (e.g., Features) and Their 
Weighted Combinations... 


Scalin 
fe) 
Fourier 
1°) 
S 
E 1 
< 
Tim Inverse 0. 
e Fourier 0.25 
Transform 
5 1 1 
If we pick the right basis/features, Peeeeiat 
the problem could become very 


Sparse! 
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Now to Our Second Question... 


¢In our learning of mathematics, have we seen 
“mathematics that learns” before? 


Roots of a Polynomial... 


Analytical 
(Direct Solution) 


Solve »~°44,+4+8=-—2,° analytically (symbolically). 
x +2x7+4x4+8=0 Standard form 
x (x+2)+4(x+2)=0 


(x7 +4)(x+2)=0 


x°+4=Oorx+2=0 © zero product property 


Factor by grouping 


2 - 
x =—-4orx=-2 So if we can factor into linear and 
x=HViorx=—2 quadratic factors, we can find the exact 
_ values of all real and complex roots. 
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Roots of a Polynomial... 


Analytical Algorithmic 
(Direct Solution) VS. (Iterative/”Learning” 
Solution) 


Solve »~°44,+4+8=-—2,° analytically (symbolically). 
Calculate a and b 


1. Procedure bisection method (a,b,é) 


3 2 
x +2x°+4x+8=0 © standard form 5 _ a+b 
~ 2 
3. Compute derivative of f(x) denoted as f (x 
(e+ 2)5- ABP 2)H0 cece: rors : en ce 
actor by grouping 4. While |a — b| => € and f(x) # 0 do 

(x7 rf 4)(x+ 2) Kj cm If f(a) x f(c) < 0 then <——— | Negative value 

A 6. bec indicates that a and 
x +4=Oorx+2=0 zero product property 7 Else C are on opposite 

2 _ 8. acc sides of the root. 
x =—4orx=-2 So if we can factor into linear and 9 atb 
wai ne xO quadratic factors, we can find the exact ales 

~~ values of all real and complex roots. 10. Return aorborc 


Which one is easier to teach a machine? (hint: the one with simple 


repeated steps). | 
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Roots of a Polynomial... 


Algorithmic 
(Iterative/”Learning” 
Solution) 


Root Approxination: Bisection 


Calculate a and b 


1. Procedure bisection method (a,b,é) 
a+b 


~ 2 
While |a — b| > € and f(x) + 0 do 
If f(a) x f(c) < 0 then <—— 
bec 
Else 


ae ¢E 
a+b 


Anes ee 


10. Return aorborc 
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Compute derivative of f(x) denoted as f (x) 


Negative value 
indicates that a and 


Cc are on opposite 
sides of the root. 
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Roots of a Polynomials smarter way would be to 


use knowledge of function local 
behavior (e.g., gradient) to plan 


oe a wArvst RA ALTA | 


Root Approxination: Bisection Root Approxination: Newton Method 


H°3<K°2=-H-1 : H°3<K°2—H=-1 
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I'm sorry. I'm afraid his | 


condition is crtheal. 


Finding Optima.., sraytica apa 


(Direct Solution) 


f(x) = 2 sin(z*)+1 5 


\ A= (-2,2.5 
a4 = (-2, 2.51) 2008 @ COURTNEY GIBBONS 


-4 0 3 


f'(—2) = —5.99 


Find derivative, and solve for x 


f'(@) = sin(x*) + 2x? cos(x*) = 0 
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Finding Optima... Araytica aa esi 


(Direct Solution) 


f(x,y) = —a* + 4(2? — y’) -3 i, 


f(x) =2 sin (az*) +1 


: Af oe Find gradient 
ie, 2.008 © COURTNEY GIBBONS 
0 
Be (®t Az" - ¥) — 3) 
i —Az? + 8x 
Vf= _ | | 
| a —8y 
-4 0 1 3 — (—a* + 4(2? = y?) _ 3) 
Oy 


f'(—2) = —5.99 


Solve system of equations Vf = O for x and y 


Find derivative, and solve for x 


f'() = sin(x*) + 2x? cos(x*) = 0 
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Finding Optima _,, Analytical 


(Direct Solution) 


f(x,y) = —a* + 4(a* —y*) —3 


f(x) =2 sin(a?)+1 


\ A= (-2,251) Find gradient 


7 


is Nl 4 72 yy 3 


O 
\ By (®t Ae — 9") — 3) 
f’(-2) = 5.99 


Solve system of equations Vf = O for x and y 


Find derivative, and solve for x 


Imagine having to do so for 
f'(x) = sin(x?) + 2x? cos(x?) = 0 


a function of thousands or 


Well, Docker”, 


\'M Sorry. tmafrard his 
condition is crtHeal. 


| 
| 
| 
= 
 * 
ce + 
QO 
8 
We] 


millions of variables! 
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Findin e O ot Ma. (iverativer*Leaming? 


Solution) 


Gradient Descent 


Initial Weight (wou) 
Learning rate (a) 


Loss (J) 


New Weight (ye, 


dJ 
Wnew ~ Wold = Fer 


Weight (W) 


Minimum point of cost function 
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Algorithmic 


FI nd | NQ O oti Ma. (lkerative/"Learning” 


Solution) 


Gradient Descent 


Initial Weight (wou) 
Learning rate (a) 


Loss (J) 


New Weight (ye, 


dJ 
Wnew ~ Wold = Fer 


Weight (W) 


Minimum point of cost function 


Algorithm 2: Gradient Descent 
input : f:R”" > Ka differentiable function 
x an initial solution 
output: x”, a local minimum of the cost function f. 


1 begin 
2 k<Q; 
8 while sTOP-CRIT and (k < kar) do 
4 xt) 2 x(k) — QO f(x) ; 
with a = arg min f(x — aVf(x)) ; 
ae 
ke k+1 
7 return x(*) 


Also approximated 
numerically! 
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FI Nn d Nn e O oti aa a . Pas ili 


Solution) 


We could try 
multiple 

Initializations to 
avoid local minima. 
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Guess Who... ? “earning” 


BAYES RULE: 


P(AIB) = P(A) x 


Learned/ 
Updated 
probability of 
event given 
new 
data/informati 
on 


Probabilities 


P(B/A) 
P(8) 


Initial probability Statistics of new 


of an event data and its 
(assumed or statistical relevance 
based on past to event of interest 


data) ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 
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Guess Who... ? “earning” 


Probabilities 
BAYeseuce: .~—~—”:—sCO( A) 
P(B/A) 
P(AIB) = P(A) x ——— 
P(B) Prior . —pP Bayes 
Probability Theorem 


Iterative “learning” 


Learned/ ; 

Updated Initial probability Statistics of new anew Se Gives 
probability of of an event data and its 

event given (assumed or statistical relevance 

new based on past to event of interest 

data/informati data) £5691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 59 


on 


G UeSS Who. o. ? “Learning” 


Probabilities 
Illustration: Guessing the Pet 36 
Problem Statement: ©"? “°F 2.408! the box? ud a>? 
Initial Belief: 50/50 chances dihey °? Ai. 
Bea » P(Cat) = 
re we XY —-~ P(Dog) = 


The Pet is 


@ We have a CLUE: Pet is quiet 
very Quiet 


——+ (P(Quiet | Cat))=80% or 0.8 
——+ (P(Quiet | Dog)) = 30% or 0.3 


From Bayes Theorem probability that pet is a cat given it is quiet is: 


P(Cat| Quiet): | P(Quiet Cat)xP(Cat) | 
P(Quiet) 
— 60 


The Good, the Bad, and the Ugly... 


ML Concerns, Limitations, and Open Problems 


The Good, the Bad, and the Ugly... 


ML Concerns, Limitations, and Open Problems 


Data Acquisition 


High Susceptibility to Errors & Biases 
Heavy Reliance on Data Quality 


Concerns of Data Privacy 
Investment of Time & Resources 


Ethical Concerns 


Difficulties in Interpretation 
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1. They Can be an Overkill... 


Tell me the truth:.I'm..I'm ready 
to hear it. 


You don’t need machine 
learning for that. 
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2. Data Hungry, Hardware Hungry, 
Power Hunory... 


FEEDING THE AI BEAST — 


Power-hungry AI is putting the hurt on global 
electricity supply 


Data centers are becoming a bottleneck for AI development. 


CAMILLA HODGSON, FINANCIAL TIMES - 4/17/2024, 6:55 PM 


Even on small organizational scale, 
reliable training requires good deal of 
reliable data. 
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3. Do Not Capture Causal Relations... 


o 
? yd 
y f 
: Ve 
ad N 


: Vv 


generally 


works on 
i ‘ CORRELATION 
Statistical 


relations. 
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3. Do Not Capture Causal Relations... 


Stop eating Ice- 


eS f cream! 


: Vv 


generally 


works on 
i ‘ CORRELATION 
Statistical 


relations. 
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3. Do Not Capture Causal Relations... 


Don’t post your  - 
summer travel a 


plans on Facebook! Ww 


[me 


CAUSATION 


; CAUSATION 
But causal relations 


often needed to truly 


understand and work 
on problems. 
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4. Bias, Reproducibility, and 
oe 


4. Bias, Reproducibility, and 
Verifiability... 

Hard to identify ethical 

Bee Een Al Model: No Loan For You 


and in results provided by 


Example: Biased 

“unchallengeable” 

decisions could 

Waukee ead 4 worsen economic 
WwW | rity. 

4 one _.What'did you disparity 


- Alinstead 
‘of biased hues, [train the Al on? 


1 a 
"| 
Gy, 


<What did you 
~*~ “Gain the Al on? 


, 
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4. Bias, Reproducibility, and 
Verifiability. 

Hard to identify 
ethical biases in 
training data and in 


results provided by a 
big ML algorithm. 


is? Reuters 


A Survey on Bias and Fairness in Machine Learning 


NINAREH MEHRABI, FRED MORSTATTER, NRIPSUTA SAXENA, 
KRISTINA LERMAN, and ARAM GALSTYAN, USC-ISI 


Insight - Amazon scraps secret Al 

recruiting tool that showed bias agains! 

women - ML was trained to find “good CVs” using CVs of 
highly successful people in Silicon Valley 
(which are mostly men - for reasons other than 
competence). 

- It started rejecting CVs with “feminine” 

language. 
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4. Bias, Reproducibility, and 
Verifiability... 


Hard to verify and 


reproduce operation of 
large models. 


nature 


Explore content Y Aboutthejournal Y Publish with us v Subscribe 


nature > news feature > article 


NEWS FEATURE | 05 December 2023 


Is Al leading to a reproducibility 
crisis in science? 


Scientists worry that ill-informed use of artificial intelligence is driving a deluge of 
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4. Bias, Reproducibility, and 
Verifiability... 


Hard to verify and 


reproduce operation of 
large models. 


- Facilitates collaboration and review processes 

- Ensures continuity of work and knowledge 
exchange retains 

t - Provides opportunity for future evaluations 

nature - Verification of results for hidden biases can help 

Explore content Y About the journal Y —_ Publish with us vr Ehpyci@ THeM 


nature > news feature > article 


NEWS FEATURE | 05 December 2023 


Is Al leading to a reproducibility 
crisis in science? 


Scientists worry that ill-informed use of artificial intelligence is driving a deluge of 
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5. Black Box - Explainability Issues... 


ML learned “features” 
are not always aligned 
with what we consider 
features. 


5. Black Box - Explainability Issues... 


Training 
Data 


ML learned “features” 
are not always aligned 


—-> It's an apple! ; ; 
with what we consider 


features. 


* Why did you do that? 
* Why not something else? 


Learning oP tare This is a cat * When do you succeed? 
Process . (p = .93) * When do you fail? 
« When can | trust you? 
* How do | correct an error? 
Learned Output User with 
Function a Task 
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6. Vulnerabilities 


Deep ML, in particular, 
IS prone to adversarial 


attacks and errors in 
data. 
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6, Vulnerabilities 


Deep ML, in particular, “re 

IS prone to adversarial 

attacks and errors in 

data. “oanda’” noise “gibbon” 
57.7% confidence 99.3% confidence 


aum Adversary/Attacker 


z 2 


“To err is human, but to 
really foul things up you 
need a computer.” 

— Paul Ehrlich 


Training Set Training Model Tests Wrong Output 
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¢How to make the learning process more efficient? 
¢ Work with lesser data, smaller model (pruning), fewer hyper- 
parameters 
*How can we include learning of causal relations? 
¢ New field: Causal ML 
¢How to best include any known physical laws (PDEs etc.) in 
the training process? 
¢ New field: Physics Informed Neural Networks (PINNs) 
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7. [Some] Open Problems 


*How to make ML reproducible, verifiable, and bias-free? 
¢How to make ML more explainable? 
¢ New field: Explainable Al (XAIl) 


¢How to reduce vulnerabilities and defend against 
attacks? 


Questions?? Thoughts? ? 
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